Source Code: https://github.com/djlofland/DATA622_MachineLearning/tree/master/Homework2

Instructions

Let’s use the Penguin dataset for our assignment. To learn more about the dataset, please visit: https://allisonhorst.github.io/palmerpenguins/articles/intro.html

We will be working with the Penguin dataset again as we did for Homework #1. Please use “Species” as your target variable. For this assignment, you may want to drop/ignore the variable “year”.

Using the target variable, Species, please conduct:

  1. Linear Discriminant Analysis (30 points):
  1. You want to evaluate all the ‘features’ or dependent variables and see what should be in your model. Please comment on your choices.

  2. Just a suggestion: You might want to consider exploring featurePlot on the caret package. Basically, you look at each of the features/dependent variables and see how they are different based on species. Simply eye-balling this might give you an idea about which would be strong ‘classifiers’ (aka predictors).

  3. Fit your LDA model using whatever predictor variables you deem appropriate. Feel free to split the data into training and test sets before fitting the model.

  4. Look at the fit statistics/ accuracy rates.

  1. Quadratic Discriminant Analysis (30 points)
  1. Same steps as above to consider
  1. Naïve Bayes (30 points)
  1. Same steps as above to consider
  1. Comment on the models fits/strength/weakness/accuracy for all these three models that you worked with. (10 points)

Load Data & EDA

# Load dataframe
df <- penguins %>% 
  as_tibble()

# drop the year column
df <- df %>% 
  dplyr::select(-year)

# show summary infio
skim(df)
Data summary
Name df
Number of rows 344
Number of columns 7
_______________________
Column type frequency:
factor 3
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂

Plan of Attack

Our data set has the following independent features:

  • Geography - island (Biscoe, Dream or TorgerSen)
  • Bill Dimensions - bill_length_mm and bill_depth_mm
  • Flipper Dimensions - flipper_length_mm
  • Overall Weight - body_weight_g
  • Gender - sex
  • Year - year … I assume this is the year the penguin was measured

Our Dependent variable is species, a categorical variable with 3 distinct values: Adelie, Chinstrap, and Gentoo.

Per instructions, we will drop the year variable. In theory, if we expected penguin characteristics to be changing over time (weather, access to food, ecological problems, etc), then year would be a potentially useful feature. If we assume there are no dramatic time dependent shifts taking place, then year probably doesn’t provide any additional resolution toward prediction.

Handle NAs

We know from the summary above that we have a few NAs. However, it is useful to know how those NAs cluster before making any decisions on how to handle them.

# show missing data
gg_miss_upset(df)

We have some NAs that need handling. Since 2 records are missing most of their data, I’ll drop those entirely. This leave 9 records missing the sex feature. Since sex might be meaningful we can either KNN impute or drop those rows. Since we only have a few rows, I am going to drop them for convenience. If there had been a larger number then we would have to consider whether to drop the feature. If we chose to impute, the best strategy would be KNN imputing based on other row features; however, while imputing is a “go to” strategy, it does in fact carry the risk of biasing our data by reducing variance. All impute techniques run this risk - when we fill in values with mean, median, or imputed, we are artificially weighing our data. This can work in our favor by, but can also work against us. As such, the decision to impute should be explored in more depth given the specific problem.

# remove rows with NAs
df <- na.omit(df)

Dummy Code Categorical

Our island feature and species outcome are categorical - for machine learning, we really need everything coded as numeric. I will do a simple dummy encoding for island which will generate 3 new binary columns, one for each island. Note, dummy coding can be done with or without an intercept. I’ll skip so we will have 3 columns, not 2. For species prediction, we prefer to only have a single column. A typical strategy is to code with a numerical substitute for character values. Since we have 3 outcomes, we can map these to 1, 2 and 3.

# convert species to a factor
species <- as.factor(df$species)

# dummy encode our categorical `island`
dummies <- dummyVars(species ~ ., drop2nd=TRUE, data = df)
df <- data.frame(predict(dummies, newdata = df))
## Warning in model.frame.default(Terms, newdata, na.action = na.action, xlev =
## object$lvls): variable 'species' is not a factor
df$species <- species

skim(df)
Data summary
Name df
Number of rows 333
Number of columns 10
_______________________
Column type frequency:
factor 1
numeric 9
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1 FALSE 3 Ade: 146, Gen: 119, Chi: 68

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
island.Biscoe 0 1 0.49 0.50 0.0 0.0 0.0 1.0 1.0 ▇▁▁▁▇
island.Dream 0 1 0.37 0.48 0.0 0.0 0.0 1.0 1.0 ▇▁▁▁▅
island.Torgersen 0 1 0.14 0.35 0.0 0.0 0.0 0.0 1.0 ▇▁▁▁▁
bill_length_mm 0 1 43.99 5.47 32.1 39.5 44.5 48.6 59.6 ▃▇▇▆▁
bill_depth_mm 0 1 17.16 1.97 13.1 15.6 17.3 18.7 21.5 ▅▆▇▇▂
flipper_length_mm 0 1 200.97 14.02 172.0 190.0 197.0 213.0 231.0 ▂▇▃▅▃
body_mass_g 0 1 4207.06 805.22 2700.0 3550.0 4050.0 4775.0 6300.0 ▃▇▅▃▂
sex.female 0 1 0.50 0.50 0.0 0.0 0.0 1.0 1.0 ▇▁▁▁▇
sex.male 0 1 0.50 0.50 0.0 0.0 1.0 1.0 1.0 ▇▁▁▁▇
rm(dummies)
rm(species)

Multicollinearity

Between feature correlations are a huge problem for most regressions. There are better techniques for which multicollinearity is not an issue. The problem with multicollinearity is that our algorithms cannot tease out which variables are contributing to the outcome when they are correlated. When we see correlated features, we need to arbitrarily remove some or reach for a technique like PCA which transforms the features into a new dimensional space with complete independence. The problem with PCA is that we loose all explainibility of the features in our model. Typically we do a forward or reverse step-wise approach to select the minimal set of features which provide the best model performance.

# Calculate and plot the Multicollinearity
correlation = cor(df[,-10], use = 'pairwise.complete.obs')

corrplot(correlation, 'ellipse', type = 'lower', order = 'hclust',
         col=brewer.pal(n=8, name="RdYlBu"))

rm(correlation)
# Show feature correlations/target by decreasing correlation
df$speciesNumeric <- as.numeric(df$species)
stack(sort(cor(df[,1:9], df[, 11])[,], decreasing=TRUE))
##        values               ind
## 1  0.85073678 flipper_length_mm
## 2  0.75043358       body_mass_g
## 3  0.73054767    bill_length_mm
## 4  0.59652777     island.Biscoe
## 5  0.01096361          sex.male
## 6 -0.01096361        sex.female
## 7 -0.31536361      island.Dream
## 8 -0.41931810  island.Torgersen
## 9 -0.74034609     bill_depth_mm

We see some strong correlations between the features - this multicollinearity will be an issues with any standard linear regression approaches. If we are wanting the best model possible, we might need to consider an approach like PCA to reframe our features into a set that are independent. We also notice in this plot that bill_length_mm, body_mass_g, and flipper_length_mm are more strongly correlated. We also see these features positively correlated with both Biscoe Island and being male. Next, we see that flipper_length_mm, body_mass-g, bill_length_mm, and island.Biscoe have the highest correlations with the target species.

Feature-Target Correlation

For regressions, we need to ensure there is a direct relationship between changes in each feature and our target outcome. Ideally, there should be a linear relationship. We can employ transformations, BoxCox, etc to help tease variables which don’t show a linear relationship.

Now lets explore any correlations between features and with the target. Note I separate Male and Female into separate chart.

males <- df %>% 
  dplyr::filter(sex.male == 1) %>% 
  dplyr::select(-c(sex.male, sex.female))

featurePlot(x=males[,1:7], y=males[,8], plot="pairs", auto.key=list(columns=3))

females <- df %>% 
  dplyr::filter(sex.female == 1) %>% 
  dplyr::select(-c(sex.male, sex.female))

featurePlot(x=females[,1:7], y=females[,8], plot="pairs", auto.key=list(columns=3))

rm(males)
rm(females)

Training-Test Split

We were not given a separate hold out group - so to better measure model performance against unseen samples, I’ll set aside 20% of our initial data as a holdout Test group. While this may slightly lower our training performance (few samples means less information captured by the model), the trade-off is that we can identify under- and over-fitting and better measure model performance.

# --- Setup Training and Test datasets
set.seed(3456)
trainIndex <- createDataPartition(df$species, p = .8, list = FALSE, times = 1)

pengTrain <- df[ trainIndex,]
pengTest  <- df[-trainIndex,]

# Cross validation setup for caret modeling
ctrl <- trainControl(method = 'cv', 
                     number = 10,
                     classProbs = T,
                     savePredictions = TRUE)

1. Linear Discriminate Analysis

We train our model on the Training Data only, assess performance, then separately predict from the holdout test set and use that for final model performance.
I only included a subset of features based on EDA above: island.Biscoe, bill_length_mm,bill_depth_mm,flipper_length_mm, andbody_mass_g`.

library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
lda_model <- train(species ~ island.Biscoe + bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g, 
               data = pengTrain, 
               method="lda",
               trControl = ctrl
)

summary(lda_model)
##             Length Class      Mode     
## prior        3     -none-     numeric  
## counts       3     -none-     numeric  
## means       15     -none-     numeric  
## scaling     10     -none-     numeric  
## lev          3     -none-     character
## svd          2     -none-     numeric  
## N            1     -none-     numeric  
## call         3     -none-     call     
## xNames       5     -none-     character
## problemType  1     -none-     character
## tuneValue    1     data.frame list     
## obsLevels    3     -none-     character
## param        0     -none-     list
varImp(lda_model)
## ROC curve variable importance
## 
##   variables are sorted by maximum importance across the classes
##                   Adelie Chinstrap Gentoo
## island.Biscoe       0.00    100.00 100.00
## flipper_length_mm  99.75     91.61  99.75
## bill_length_mm     92.73     92.73  90.85
## bill_depth_mm      88.92     92.48  92.48
## body_mass_g        85.61     89.32  89.32
# --- Evaluate our model
train_pred <- as.factor(predict(lda_model, newdata = pengTrain))
confusionMatrix(train_pred, pengTrain$species)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie       116         2      0
##   Chinstrap      1        53      0
##   Gentoo         0         0     96
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9888          
##                  95% CI : (0.9676, 0.9977)
##     No Information Rate : 0.4366          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9825          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9915           0.9636        1.0000
## Specificity                 0.9868           0.9953        1.0000
## Pos Pred Value              0.9831           0.9815        1.0000
## Neg Pred Value              0.9933           0.9907        1.0000
## Prevalence                  0.4366           0.2052        0.3582
## Detection Rate              0.4328           0.1978        0.3582
## Detection Prevalence        0.4403           0.2015        0.3582
## Balanced Accuracy           0.9891           0.9795        1.0000
test_pred <- as.factor(predict(lda_model, newdata = pengTest))
confusionMatrix(test_pred, pengTest$species)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        28         0      0
##   Chinstrap      1        13      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9846          
##                  95% CI : (0.9172, 0.9996)
##     No Information Rate : 0.4462          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9759          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9655           1.0000        1.0000
## Specificity                 1.0000           0.9808        1.0000
## Pos Pred Value              1.0000           0.9286        1.0000
## Neg Pred Value              0.9730           1.0000        1.0000
## Prevalence                  0.4462           0.2000        0.3538
## Detection Rate              0.4308           0.2000        0.3538
## Detection Prevalence        0.4308           0.2154        0.3538
## Balanced Accuracy           0.9828           0.9904        1.0000
# save final performance for later comparison
modelValues <- data.frame(obs = pengTest$species, pred=test_pred)
colnames(modelValues) = c('obs', 'pred')
(lda_values <- defaultSummary(modelValues))
##  Accuracy     Kappa 
## 0.9846154 0.9759437

2. Quadratic Discriminant Analysis

Note, I had problems running straight qda() within caret and train(). So, I went ahead and used stepDQA which applies a stepwise feature selection and evaluation using QDA.

qda_model <- train(species ~ island.Biscoe + bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g, 
                   data = pengTrain, 
                   method="stepQDA",
                   trControl = ctrl,
                   metric = "Accuracy"
)
##  `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 242 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.82617;  in: "flipper_length_mm";  variables (1): flipper_length_mm 
## correctness rate: 0.94617;  in: "bill_length_mm";  variables (2): flipper_length_mm, bill_length_mm 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##       0.000       0.000       0.158
##  `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 241 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.81333;  in: "flipper_length_mm";  variables (1): flipper_length_mm 
## correctness rate: 0.9585;  in: "bill_length_mm";  variables (2): flipper_length_mm, bill_length_mm 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##       0.000       0.000       0.148
##  `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 241 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.82167;  in: "flipper_length_mm";  variables (1): flipper_length_mm 
## correctness rate: 0.9585;  in: "bill_length_mm";  variables (2): flipper_length_mm, bill_length_mm 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##       0.000       0.000       0.147
##  `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 241 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.81733;  in: "flipper_length_mm";  variables (1): flipper_length_mm 
## correctness rate: 0.94583;  in: "bill_length_mm";  variables (2): flipper_length_mm, bill_length_mm 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##       0.000       0.000       0.148
##  `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 241 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.81717;  in: "flipper_length_mm";  variables (1): flipper_length_mm 
## correctness rate: 0.95417;  in: "bill_length_mm";  variables (2): flipper_length_mm, bill_length_mm 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##       0.000       0.000       0.146
##  `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 242 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.818;  in: "flipper_length_mm";  variables (1): flipper_length_mm 
## correctness rate: 0.95033;  in: "bill_length_mm";  variables (2): flipper_length_mm, bill_length_mm 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##       0.000       0.000       0.145
##  `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 241 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.8175;  in: "flipper_length_mm";  variables (1): flipper_length_mm 
## correctness rate: 0.95883;  in: "bill_length_mm";  variables (2): flipper_length_mm, bill_length_mm 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##       0.000       0.000       0.143
##  `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 242 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.818;  in: "flipper_length_mm";  variables (1): flipper_length_mm 
## correctness rate: 0.95033;  in: "bill_length_mm";  variables (2): flipper_length_mm, bill_length_mm 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##       0.000       0.000       0.142
##  `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 241 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.82583;  in: "flipper_length_mm";  variables (1): flipper_length_mm 
## correctness rate: 0.95;  in: "bill_length_mm";  variables (2): flipper_length_mm, bill_length_mm 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##       0.000       0.000       0.137
##  `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 240 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.825;  in: "flipper_length_mm";  variables (1): flipper_length_mm 
## correctness rate: 0.95417;  in: "bill_length_mm";  variables (2): flipper_length_mm, bill_length_mm 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##       0.000       0.000       0.137
##  `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 268 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.81752;  in: "flipper_length_mm";  variables (1): flipper_length_mm 
## correctness rate: 0.94786;  in: "bill_length_mm";  variables (2): flipper_length_mm, bill_length_mm 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##       0.000       0.000       0.139
summary(qda_model)
##                     Length Class      Mode     
## call                6      -none-     call     
## method              1      -none-     character
## start.variables     0      -none-     NULL     
## process             4      data.frame list     
## model               2      data.frame list     
## result.pm           2      -none-     numeric  
## runtime             3      -none-     numeric  
## performance.measure 1      -none-     character
## formula             3      formula    call     
## fit                 8      qda        list     
## xNames              5      -none-     character
## problemType         1      -none-     character
## tuneValue           2      data.frame list     
## obsLevels           3      -none-     character
## param               0      -none-     list
varImp(qda_model)
## ROC curve variable importance
## 
##   variables are sorted by maximum importance across the classes
##                   Adelie Chinstrap Gentoo
## island.Biscoe       0.00    100.00 100.00
## flipper_length_mm  99.75     91.61  99.75
## bill_length_mm     92.73     92.73  90.85
## bill_depth_mm      88.92     92.48  92.48
## body_mass_g        85.61     89.32  89.32
# --- Evaluate our model
train_pred <- as.factor(predict(qda_model, newdata = pengTrain))
confusionMatrix(train_pred, pengTrain$species)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie       112         5      0
##   Chinstrap      3        48      1
##   Gentoo         2         2     95
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9515          
##                  95% CI : (0.9185, 0.9739)
##     No Information Rate : 0.4366          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9239          
##                                           
##  Mcnemar's Test P-Value : 0.418           
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9573           0.8727        0.9896
## Specificity                 0.9669           0.9812        0.9767
## Pos Pred Value              0.9573           0.9231        0.9596
## Neg Pred Value              0.9669           0.9676        0.9941
## Prevalence                  0.4366           0.2052        0.3582
## Detection Rate              0.4179           0.1791        0.3545
## Detection Prevalence        0.4366           0.1940        0.3694
## Balanced Accuracy           0.9621           0.9270        0.9832
test_pred <- as.factor(predict(qda_model, newdata = pengTest))
confusionMatrix(test_pred, pengTest$species)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        28         1      0
##   Chinstrap      0        12      1
##   Gentoo         1         0     22
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9538         
##                  95% CI : (0.871, 0.9904)
##     No Information Rate : 0.4462         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9274         
##                                          
##  Mcnemar's Test P-Value : 0.3916         
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9655           0.9231        0.9565
## Specificity                 0.9722           0.9808        0.9762
## Pos Pred Value              0.9655           0.9231        0.9565
## Neg Pred Value              0.9722           0.9808        0.9762
## Prevalence                  0.4462           0.2000        0.3538
## Detection Rate              0.4308           0.1846        0.3385
## Detection Prevalence        0.4462           0.2000        0.3538
## Balanced Accuracy           0.9689           0.9519        0.9664
# save final performance for later comparison
modelValues <- data.frame(obs = pengTest$species, pred=test_pred)
colnames(modelValues) = c('obs', 'pred')
(qda_values <- defaultSummary(modelValues))
##  Accuracy     Kappa 
## 0.9538462 0.9274013

3. Naive Bayes

nb_model <- train(species ~ island.Biscoe + bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g, 
                  data = pengTrain, 
                  method="nb",
                  rControl = ctrl
)

summary(nb_model)
##             Length Class      Mode     
## apriori     3      table      numeric  
## tables      5      -none-     list     
## levels      3      -none-     character
## call        7      -none-     call     
## x           5      data.frame list     
## usekernel   1      -none-     logical  
## varnames    5      -none-     character
## xNames      5      -none-     character
## problemType 1      -none-     character
## tuneValue   3      data.frame list     
## obsLevels   3      -none-     character
## param       1      -none-     list
varImp(nb_model)
## ROC curve variable importance
## 
##   variables are sorted by maximum importance across the classes
##                   Adelie Chinstrap Gentoo
## island.Biscoe       0.00    100.00 100.00
## flipper_length_mm  99.75     91.61  99.75
## bill_length_mm     92.73     92.73  90.85
## bill_depth_mm      88.92     92.48  92.48
## body_mass_g        85.61     89.32  89.32
# --- Evaluate our model
train_pred <- as.factor(predict(nb_model, newdata = pengTrain))
confusionMatrix(train_pred, pengTrain$species)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie       116         5      0
##   Chinstrap      1        50      0
##   Gentoo         0         0     96
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9776          
##                  95% CI : (0.9519, 0.9917)
##     No Information Rate : 0.4366          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9648          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9915           0.9091        1.0000
## Specificity                 0.9669           0.9953        1.0000
## Pos Pred Value              0.9587           0.9804        1.0000
## Neg Pred Value              0.9932           0.9770        1.0000
## Prevalence                  0.4366           0.2052        0.3582
## Detection Rate              0.4328           0.1866        0.3582
## Detection Prevalence        0.4515           0.1903        0.3582
## Balanced Accuracy           0.9792           0.9522        1.0000
test_pred <- as.factor(predict(nb_model, newdata = pengTest))
confusionMatrix(test_pred, pengTest$species)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        28         3      0
##   Chinstrap      1        10      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9385         
##                  95% CI : (0.8499, 0.983)
##     No Information Rate : 0.4462         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.902          
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9655           0.7692        1.0000
## Specificity                 0.9167           0.9808        1.0000
## Pos Pred Value              0.9032           0.9091        1.0000
## Neg Pred Value              0.9706           0.9444        1.0000
## Prevalence                  0.4462           0.2000        0.3538
## Detection Rate              0.4308           0.1538        0.3538
## Detection Prevalence        0.4769           0.1692        0.3538
## Balanced Accuracy           0.9411           0.8750        1.0000
# save final performance for later comparison
modelValues <- data.frame(obs = pengTest$species, pred=test_pred)
colnames(modelValues) = c('obs', 'pred')
(nb_values <- defaultSummary(modelValues))
##  Accuracy     Kappa 
## 0.9384615 0.9020347

4. Compare models

(models_summary <- rbind(lda_values, qda_values, nb_values))
##             Accuracy     Kappa
## lda_values 0.9846154 0.9759437
## qda_values 0.9538462 0.9274013
## nb_values  0.9384615 0.9020347

All three models performed extremely well with LDA doing best. Given how well these performed, it might be worth learning these modeling techniques with a different data set where we do see differences and can relate those to how the model work. I would note that the classes had fairly good separation which problably led to good LDA performance. Naive Bayes makes an assumption of independence of features which might explain it’s poorer performance. We didn’t do anything to handle multicolinearity. These models performed sufficiently that I did not go back and do further feature cleanup. If model performance had been lower, I would have considered normalizing my features (center, scale. and potentially BoxCox), but that just wasn’t necessary.